-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SHUFFLE] [WIP] Prototype: store shuffle file on external storage like S3 #34864
Conversation
Kubernetes integration test starting |
Kubernetes integration test status failure |
@hiboyang, thanks for the work here! Could you create a design doc for this? That might help get more people attention and easier for them to understand. |
Yes, good suggestion. Will create some design doc. |
Test build #146081 has finished for PR 34864 at commit
|
Quickly glanced through the code, seems for writing shuffle data we are writing locally first and then upload to S3, similarly for reading shuffle data we are downloading data to a local temp file first and then read. We should be able to write/read direct to/from S3, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, All. Thank you!
BTW, for the record, Apache Spark 3.1+ already stores its shuffle files into the external storage like S3 and reads back from it.
- [SPARK-33545][CORE] Support Fallback Storage during Worker decommission (Apache Spark 3.1.0)
- [SPARK-34142][CORE] Support Fallback Storage Cleanup during stopping SparkContext (Apache Spark 3.2.0)
- [SPARK-37509][CORE] Improve Fallback Storage upload speed by avoiding S3 rate limiter (Apache Spark 3.3.0)
It would be great not to ignore the existing Spark feature and avoid over-claiming.
Dynamic allocation is the same. Apache Spark has been supporting Dynamic Allocation in K8s too.
Thanks for looking! Yes, we should be able to write/read direct on S3. This PR is a prototype. Still need to improve the code and performance of writing/reading shuffle data on S3. |
Right, Spark has shuffle tracking to support Dynamic Allocation on Kubernetes, but it will not work well when there is shuffle data distributed on many executors (those executors cannot be released). The work here (storing shuffle data on S3) does not conflict with worker decommission feature. The eventual goal is to store shuffle data on S3 or other external storage directly. Before getting there, people could still use the worker decommission feature. |
You are completely wrong because you already know the worker decommission feature.
You should mention this in the PR description explicitly instead of misleading the users.
|
Hi Dongjoon, you got some misunderstandings here. I am writing a design doc for this PR. Hope that will help you to understand more and address your questions.
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146147 has finished for PR 34864 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #146303 has finished for PR 34864 at commit
|
Add a design doc for this prototype. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some quick comments on the pr
<artifactId>commons-lang3</artifactId> | ||
</dependency> | ||
<dependency> | ||
<groupId>com.amazonaws</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should pull in spark-hadoop-cloud and so indirectly get its shaded full aws sdk. yes, it's big, but iat guarantees that it has a consistent set of its own dependencies (http client, jackson etc) and because it includes support for services like STS and s3 events, lets you add new features with guaranteed consistency of aws artifacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! Yes, I was thinking to use that hadoop library as well, then did not do it due to wanting to start small with this prototype. It sounds a good idea to switch to hadoop library.
ManagedBuffer managedBuffer = downloadFileWritableChannel.closeAndRead(); | ||
listener.onBlockFetchSuccess(blockIdStr, managedBuffer); | ||
} catch (IOException e) { | ||
throw new RuntimeException(String.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include the inner exception text in the message and supply the exception as the inner exception in the constructor
NioManagedBuffer managedBuffer = new NioManagedBuffer(byteBuffer); | ||
listener.onBlockFetchSuccess(blockIdStr, managedBuffer); | ||
} catch (IOException e) { | ||
throw new RuntimeException(String.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, pass on inner exception details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch! Will add inner exception!
import scala.collection.Iterator; | ||
|
||
import javax.annotation.Nullable; | ||
import java.io.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bit brittle against jvm releases adding new classes here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, let me remove .* here.
public static final String DEFAULT_AWS_REGION = Regions.US_WEST_2.getName(); | ||
|
||
private static TransferManager transferManager; | ||
private static Object transferManagerLock = new Object(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes!
throw new RuntimeException(String.format( | ||
"Failed to download shuffle file %s", s3Url)); | ||
} finally { | ||
transferManager.shutdownNow(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the transfer manager is reused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a code mistake. I should not shutdown shuffle manager here. Will remove this.
Obviously I am biased, but I believe that rather than trying to use the AWS APIs yourself, you should just use the hadoop file system APIs and interact with S3 through the s3a connector. For a high-performance upload of a local file, use As This also means that you could write tests for the feature using file:// as the destination store and include these in the spark module; if you design such tests to be overrideable to work with other file systems, they could be picked up and reused as the actual integration test suites in an external module. And, because someone else owns the problem of the s3 connector binding, you get to avoid fielding support calls related to configuring of AWS endpoint, region, support for third-party s3 stores, qualifying AWS SDK updates, etc. Accordingly, I would propose
Getting integration tests set up is inevitably going to be somewhat complicated. I can provide a bit of consultation there. |
Yes, these are great suggestions! Thanks again! I will find time to make change for this, and may also reach out to your for consultation when adding integration test :) |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
@hiboyang I looked at your work earlier this year and I wanted to let you know that I used it as a basis for a shuffle plugin. Ultimately I decided to rewrite the plugin from scratch (except the tests) and orient myself on the design of the Spark shuffle manager. The code is available here: https://github.com/ibm/spark-s3-shuffle/ . It acts as an external Spark plugin and allows to be loaded into Spark binary releases as a plugin. I'm open to contributing this work back to Apache Spark if there is any interest. |
Hi @pspoerri great you are working on this and thanks for letting us know! I stopped working on my previous PR due to changing of work priority, but still would like to see people continue working in this area. There is big value if storing Spark shuffle data on S3. It will save cost and also make Spark more resilient to disk error. During my previous experiment, shuffle data on S3 will have very worse performance. There are a lot of optimization needed, e.g. S3 key prefix randomization to avoid S3 throttling and async S3 write. Will be happy to hear your thoughts on this as well. |
+1, would be great to get this working and part of the Spark ecosystem. |
@michaelbilow hadoop s3a is on v2 sdk; the com.amazonaws classes are not on the CP and amazon are slowly stopping support. you cannot for example use the lower latency S3 express stores with it. Like I say: I think you would be better off using the Hue file system APIs to talk to s3. If there are aspects of s3 storage which aren't available through the API -or just very inefficiently due to the effort to preserve the Posix metaphor, then lets fix the API so that other stores can offer the same features, and other apps can pick up. For example, here's our ongoing delete API for iceberg and other manifest-based tables |
@steveloughran How do I call the Hue APIs from Spark? Can you point me to a package? Another issue is that Hadoop wants to know the size of every file it wants to read. While this makes sense for formats like parquet where the header is located at the last few bytes of the file. It does not make sense for shuffle where you know the exact block/file you want to read. |
+1, would be great to get this working and part of the Spark ecosystem. |
What changes were proposed in this pull request?
This PR (design doc) provides support to store shuffle files on external shuffle storage like S3. It helps Dynamic
Allocation on Kubernetes. Spark driver could release idle executors without worrying about losing
shuffle data because the shuffle data is store on external shuffle storage which are different
from executors.
This could be viewed as a followup work for https://issues.apache.org/jira/browse/SPARK-25299.
There is previously Worker Decommission feature (SPARK-33545), which is a great feature to copy shuffle data to fallback storage like S3. People appreciate that work to address the critical issue to handle shuffle data on Spark executor termination. The work in the PR does not intent to replace that feature. The intent is to get further discussion about how to save shuffle data on S3 during normal execution time.
Why are the changes needed?
To better support Dynamic Allocation on Kubernetes, we need to decouple shuffle data from Spark
executor. This PR implements another Shuffle Manager and support writing shuffle data on S3.
Does this PR introduce any user-facing change?
Yes, this PR adds two Spark config like following to plug in another StarShuffleManager and store
shuffle data on provided S3 location.
How was this patch tested?
Added a unit test for StartShuffleManager. A lot of classes are copied from Spark, thus not add tests
for those classes. We will work with the community to get feedback first, then work on removing code
copy/duplication.